We will work on scRNA-seq data of mouse gastrulation and early organogenesis from Pijuan-Sala et al., 2019. This Shiny application provides an interactive interface that allows users to validate their own analysis on this data. You can reach the original data and related scripts from the Github page.
Gastrulation is a phase early in the embryonic development of most animals, during which the single-layered blastula is reorganized into a multilayered structure known as the gastrula (Wikipedia, 2020-06-02).
Let’s start with including required libraries.
suppressPackageStartupMessages({
library(Seurat)
library(ggplot2)
library(dplyr)
library(data.table)
library(cowplot)
set.seed(1)
})
In the previous practical, we have learned the essential preprocessing steps for working with scRNA-seq. Here we will start working with the preprocessed version of the mouse gastrulation scRNA-seq data. We will load the pre-saved Seurat object of this data.
Warning: The pre-processed mouse gastrulation scRNA-seq Seurat object is large (2GBs). In today’s practical, we will subset the data for fast computation. You can, however, work on the original data if you want to practise what we learned today. The meta-data is provided seperately.
mgsc <- readRDS('data/gastrulation/mgsc.rds')
mgsc
## An object of class Seurat
## 29452 features across 116312 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
Reminder: number of rows = genes = features, and number of columns = cells = samples
# let's see the available slots
slotNames(mgsc)
## [1] "assays" "meta.data" "active.assay" "active.ident" "graphs"
## [6] "neighbors" "reductions" "project.name" "misc" "version"
## [11] "commands" "tools"
Now let’s look at to the metadata table of this dataset that contains an overview of the samples.
# see it is empty for now
head(mgsc@meta.data, 5)
## data frame with 0 columns and 5 rows
# now let's load the metadata
metadata <- fread('data/gastrulation/sample_metadata.txt.gz') %>% .[stripped==FALSE & doublet==FALSE]
# and add the metadata to our seurat object
mgsc <- AddMetaData(mgsc, metadata = data.frame(metadata, row.names = metadata$cell))
# now let's see once more
head(mgsc@meta.data, 5)
## cell barcode sample stage sequencing.batch doublet stripped
## cell_1 cell_1 AAAGGCCTCCACAA 1 E6.5 1 FALSE FALSE
## cell_2 cell_2 AACAAACTCGCCTT 1 E6.5 1 FALSE FALSE
## cell_5 cell_5 AACAGAGAATCAGC 1 E6.5 1 FALSE FALSE
## cell_6 cell_6 AACATATGAATCGC 1 E6.5 1 FALSE FALSE
## cell_8 cell_8 AACCGATGGCTTCC 1 E6.5 1 FALSE FALSE
## celltype umapX umapY celltype2 celltype3
## cell_1 Epiblast -10.227546 -2.8816875 Epiblast Epiblast-PS
## cell_2 Primitive_Streak -6.625458 0.1089605 Primitive_Streak Epiblast-PS
## cell_5 ExE_ectoderm 10.061009 -0.0293132 ExE_ectoderm ExE_ectoderm
## cell_6 Epiblast -10.454418 -0.2694517 Epiblast Epiblast-PS
## cell_8 Epiblast -11.047206 -2.2052687 Epiblast Epiblast-PS
Now we are able to see the annotations (e.g. cell type) for each cell. There is also a column named stage which shows the embryonic day the cells were sequenced. To speed up our experiments, we will work on a subset of cells that belong to stage E6.75.
mgsc_subset <- mgsc[ , mgsc@meta.data$stage=='E6.75']
mgsc_subset
## An object of class Seurat
## 29452 features across 2075 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
# now lets save this subset
saveRDS(mgsc_subset, file = "data/gastrulation/mgsc_e675.rds")
mgsc <- mgsc_subset
This is where you are starting from! :) Now you can load mgsc_e675.rds object to start working! Make sure that the object you loaded has same numbers of columns and rows with E6.75 Seurat object.
Q1: Let’s warm up! Can you try to subset cells in locations \(1, 23, 515\) and genes in locations \(21, 44, 116\), respectively?
Answer
s1 <- mgsc[, c(1,23,515)] # Subset to cells 1, 23, 515
s2 <- mgsc[c(21,44,116), ] # Subset to genes 21, 44, 116
s1
## An object of class Seurat
## 29452 features across 3 samples within 1 assay
## Active assay: RNA (29452 features, 0 variable features)
s2
## An object of class Seurat
## 3 features across 2075 samples within 1 assay
## Active assay: RNA (3 features, 0 variable features)
We can use rownames and colnames to see the names of genes and cells.
# rownames = gene names
head(rownames(mgsc))
## [1] "ENSMUSG00000051951" "ENSMUSG00000089699" "ENSMUSG00000102343"
## [4] "ENSMUSG00000025900" "ENSMUSG00000025902" "ENSMUSG00000104328"
# colnames = sample/cell names
head(colnames(mgsc))
## [1] "cell_5456" "cell_5457" "cell_5458" "cell_5459" "cell_5460" "cell_5461"
Q2: Observe the count data of the first 15 cells for genes “ENSMUSG00000051951” and “ENSMUSG00000033845”. (Hint: use GetAssayData function. )
Answer
GetAssayData(object = mgsc)[c("ENSMUSG00000051951", "ENSMUSG00000033845"), 1:15]
## 2 x 15 sparse Matrix of class "dgCMatrix"
## [[ suppressing 15 column names 'cell_5456', 'cell_5457', 'cell_5458' ... ]]
##
## ENSMUSG00000051951 . . . . . .
## ENSMUSG00000033845 1.590889 3.265748 3.693841 2.28663 3.645026 2.568868
##
## ENSMUSG00000051951 . . . . . .
## ENSMUSG00000033845 2.251156 2.997522 3.052133 2.881758 2.380428 3.334173
##
## ENSMUSG00000051951 . . .
## ENSMUSG00000033845 1.34505 2.986037 2.199593
# mgsc[c("ENSMUSG00000051951", "ENSMUSG00000033845"), 1:15]@assays$RNA@counts
The . values in the matrix represent \(0\)s. Since most values in an scRNA-seq matrix are \(0\), Seurat uses a sparse-matrix representation to speed up calculations and reduce memory usage.
Q3:Our feature set (i.e. number of genes) are quite large. What kind of features do you think should be in the dataset?
Seurat implements FindVariableFeatures to model the mean-variance relationship in single-cell data which reduces to dimensions to 2000 by default. We will update nfeatures argument to reduce dimensionality to 1000.
mgsc <- FindVariableFeatures(mgsc, selection.method = "vst", nfeatures = 1000)
options(repr.plot.width=12, repr.plot.height=6)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(mgsc), 10)
# plot variable features
plot1 <- VariableFeaturePlot(mgsc)
LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
## Warning: Transformation introduced infinite values in continuous x-axis
head(HVFInfo(mgsc)[VariableFeatures(mgsc), ], 5)
## mean variance variance.standardized
## ENSMUSG00000032083 10.008193 971.0756 24.49617
## ENSMUSG00000024990 3.765301 210.4092 22.38176
## ENSMUSG00000061808 7.784578 546.4217 21.09222
## ENSMUSG00000024503 6.853976 419.2819 19.63029
## ENSMUSG00000024391 3.319036 112.9888 15.76248
Q4: Seurat implements the ScaleData function to scale the data. Do you remember why we need this step?
mgsc <- ScaleData(mgsc) #Vector of features names to scale/center. Default is variable features.
## Centering and scaling data matrix
The results are stored inmgsc[["RNA"]]@scale.data.
mgsc[["RNA"]]@scale.data[1:5,1:5]
## cell_5456 cell_5457 cell_5458 cell_5459 cell_5460
## ENSMUSG00000025902 -0.3250561 -0.3250561 -0.3250561 -0.3250561 -0.3250561
## ENSMUSG00000042182 -0.0428776 -0.0428776 -0.0428776 -0.0428776 -0.0428776
## ENSMUSG00000026124 -0.3533363 -0.3533363 -0.3533363 -0.3533363 -0.3533363
## ENSMUSG00000008136 -0.3580923 -0.3580923 -0.3580923 -0.3580923 -0.3580923
## ENSMUSG00000025993 -0.3715563 -0.3715563 -0.3715563 -0.3715563 -0.3715563
# OR
# mgsc@assays$RNA@scale.data[1:5,1:5]
Let’s use RunPCA function of the Seurat. Pay attention to the features argument. How many features will PCA work on?
mgsc <- RunPCA(mgsc, npcs = 100, features = VariableFeatures(object = mgsc))
We can explore the constructed embeddings via mgsc@reductions.
slotNames(mgsc@reductions[["pca"]])
## [1] "cell.embeddings" "feature.loadings"
## [3] "feature.loadings.projected" "assay.used"
## [5] "global" "stdev"
## [7] "key" "jackstraw"
## [9] "misc"
dim(mgsc@reductions[["pca"]]@cell.embeddings)
## [1] 2075 100
mgsc@reductions[["pca"]]@cell.embeddings[1:5,1:5]
## PC_1 PC_2 PC_3 PC_4 PC_5
## cell_5456 -7.525323 -2.044025 -1.7048600 -3.297759 -0.00652879
## cell_5457 -8.266647 -3.636625 0.6928462 -3.041104 -0.18143907
## cell_5458 -7.687170 -3.591576 -2.1696897 -3.157680 -0.03483741
## cell_5459 -7.256699 -2.657866 -1.7883639 -3.313753 -1.10164595
## cell_5460 -7.410198 -2.796104 -1.9718807 -4.048726 -0.79348906
# OR
# Embeddings(mgsc, reduction = "pca")[1:5, 1:5]
Q5: Do you remember what kind of information feature.loadings store?
# (1) we can visualize top genes associated with pca embeddings
VizDimLoadings(mgsc, dims = 1:2, reduction = "pca")
VizDimLoadings(mgsc, dims = 3:4, reduction = "pca")
Q6: Name that one method that we used to determine the dimensionality. Let’s try to find out how much of the variance is explained by the PCs. (Hint: check slotNames(mgsc@reductions[["pca"]]) which feature might help you with this. )
Answer
percent <- 100*mgsc@reductions[["pca"]]@stdev^2/sum(mgsc@reductions[["pca"]]@stdev^2)
perc_data <- data.frame(percent=percent, PC=1:length(percent))
ggplot(perc_data, aes(x=PC, y=percent)) +
geom_line(aes(y= percent), linetype=2) +
scale_x_continuous(breaks=seq(from=0, to=80, by= 10)) +
geom_text(aes(label=round(percent, 2)), size=2, vjust=-.5) +
ylim(0, 80)
# OR you can directly use Seurat function
ElbowPlot(mgsc, ndims = 80)
Q7: Now it’s time to visualize our data! Generate the following output: three 2D plots with the first six PCs and print them side by side. i.e. PC1-PC2, PC3-PC4, PC5-PC6. (Hint: Utilize DimPlot and plot_grid functions.)
Answer
# (2)
plot_grid(ncol = 3,
DimPlot(mgsc, reduction = "pca", dims = 1:2) + theme(legend.position="none"),
DimPlot(mgsc, reduction = "pca", dims = 3:4) + theme(legend.position="none"),
DimPlot(mgsc, reduction = "pca", dims = 5:6) + theme(legend.position="none") )
Seurat comprises RunTSNE which uses Rtsne library (which we previously worked with in the lecture).
# we are using the PCA embeddings as our input
mgsc <- RunTSNE(mgsc, reduction="pca", dims = 1:30, nthreads = 4, max_iter = 2000)
mgsc@reductions
## $pca
## A dimensional reduction object with key PC_
## Number of dimensions: 100
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: TRUE
## Computed using assay: RNA
##
## $tsne
## A dimensional reduction object with key tSNE_
## Number of dimensions: 2
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: RNA
DimPlot(mgsc, reduction = "tsne")
Q8:Do you remember one of the important hyper-parameters of t-SNE? Print t-SNE plots using DimPlot and explore how structure of the printed data changes with this parameter. Keep dimension as \(30\). (Hint: reduction.name will help you save your embeddings that are produced by different configurations. )
mgsc <- RunUMAP(mgsc, dims = 1:30)
DimPlot(mgsc, reduction = "umap")
Homework: How about the important parameters for UMAP? Print following UMAP plots using DimPlot and explore how structure of the visualized data changes with these two parameters. Choose dimensions between \(1:30\).
Homework: Plot the projections of the dataset with three of the dimensionality reduction techniques printed side by side and explore AugmentPlot.
For our subset of time point E6.5, we have the cell type annotation information. Although this is not usually the case in many computational problems, i.e. we start without knowing (or having a very vague idea of) how many clusters we will end up. For now let’s take advantage of the known annotations.
Q9: How many \(real\) classes we have? Reproduce the following plots colored by cell type. ( Hint: You can make use group.by argument in DimPlot to extract stored cluster IDs.)
## [1] "Epiblast" "Primitive_Streak"
## [3] "ExE_endoderm" "Visceral_endoderm"
## [5] "ExE_ectoderm" "Nascent_mesoderm"
## [7] "Anterior_Primitive_Streak" "PGC"
## [9] "Mixed_mesoderm" "Rostral_neurectoderm"
## [11] "Surface_ectoderm" "Def._endoderm"
## [13] "Haematoendothelial_progenitors"
Seurat does not support K-means? What are we going to do now!? :)
Q10: What do we need to run \(K-means\)? Try to reproduce the following plot where true classes are compared to k-means clusters (\(k=13\)). ( Hint: Make use of Embeddings function or utilize reductions we previously used in the previous examples. You can save cluster IDs to mgsc@meta.data field. )
## [1] 2075 30
Homework: Explore how clusters would have changed if you used t-SNE embeddings instead of PCA as an input to k-means algorithm.
## [1] 2075 2
We already know the number of cell-types in our data. But this is not always the case. How would we decide the number of clusters if we didn’t know the number of classes? Which \(k\)-means feature did we look at the to decide number of clusters?
Homework: Print the following plot using the same embeddings from the previous example. (Hint: Lecture notes:))
Can we decide the number of clusters looking at this plot?
Seurat does not support hierarchical clustering too? But we can do it!!
Q11:Which elements do we need to perform hierarchical clustering? Try to reproduce the following dendogram. Use Euclidian as distance metric and ward.D2 as linkage method.
Looking at the dendogram, we can investigate different numbers of clusters using cutree function that cuts the hierarchical clustering tree into given number of clusters.
Q12: How about we print different clustering outcomes with \(k=4, 8, 16\) using t-SNE?
Seuratv3 adopts a graph-based clustering methodology much similar to PhenoGraph approach we discussed earlier.
FindNeighbors function performs the following steps:
* (1) build a kNN graph using Euclidean distance in PCA space
* (2) update the edge weights based on the shared neighbors (Jaccard index) to construct a Shared Nearest Neighbor (SNN) graph.
Once the graph is constucted, we can use FindClusters function to apply a community detection algorithm to identify subgroups. Keep in mind that Seurat uses the Louvain algorithm as a default for this step.
mgsc <- FindNeighbors(mgsc, k.param = 20, dims = 1:30, reduction = "pca")
mgsc <- FindClusters(mgsc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8377
## Number of communities: 4
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 20 cells
head(Idents(mgsc), 20)
## cell_5456 cell_5457 cell_5458 cell_5459 cell_5460 cell_5461 cell_5462 cell_5463
## 0 0 0 0 0 0 1 0
## cell_5465 cell_5466 cell_5467 cell_5468 cell_5469 cell_5470 cell_5471 cell_5472
## 0 1 0 3 0 0 3 3
## cell_5473 cell_5474 cell_5475 cell_5476
## 2 3 3 0
## Levels: 0 1 2 3
FindClusters has a resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
The Seurat authors suggest that granularity values between 0.4-1.2 usually return good results for sc datasets of around 3K cells. For larger datasets, optimal resolution often increases for larger datasets.
Homework: Explore how different values of \(granularity\) affect the clustering results and reproduce the following plot. (Hint: Use Idents function of Seurat to save cluster ids. )
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8658
## Number of communities: 4
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7710
## Number of communities: 6
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2075
## Number of edges: 99889
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.7075
## Number of communities: 9
## Elapsed time: 0 seconds